chizat and bach
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Japan (0.04)
We thank the reviewers for their time and constructive feedback on the submission, which we will incorporate to 1 improve our manuscript
We find that they are positive-definite as expected. Supervised Differentiable Programming" by Chizat and Bach is an important contribution and we will absolutely Sec 2.2 in V1, V2) are restricted to single-hidden-layer networks. It is still an open research question to determine what are the main factors that determine these performance gaps. We will expand discussion around this.
Directional convergence and alignment in deep learning
The above theories, with finite width networks, usually require the weights to stay close to initialization in certain norms. By contrast, practitioners run their optimization methods as long as their computational budget allows [Shallue et al., 2018], and if the data can be perfectly classified, the
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Japan (0.04)
Reviews: Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
The paper was proofread, well-structured, and very clear. The experiments were clearly described in detail, and provided relevant results. Below we outline some detailed comments of the results. In particular, Chizat and Bach prove that the training of an NTK parameterized network is closely modeled by "lazy training" (their terminology for a linearized model). This paper is not referenced in the related work section.
Regression as Classification: Influence of Task Formulation on Neural Network Features
Stewart, Lawrence, Bach, Francis, Berthet, Quentin, Vert, Jean-Philippe
Neural networks can be trained to solve regression problems by using gradient-based methods to minimize the square loss. However, practitioners often prefer to reformulate regression as a classification problem, observing that training on the cross entropy loss results in better performance. By focusing on two-layer ReLU networks, which can be fully characterized by measures over their feature space, we explore how the implicit bias induced by gradient-based optimization could partly explain the above phenomenon. We provide theoretical evidence that the regression formulation yields a measure whose support can differ greatly from that for classification, in the case of one-dimensional data. Our proposed optimal supports correspond directly to the features learned by the input layer of the network. The different nature of these supports sheds light on possible optimization difficulties the square loss could encounter during training, and we present empirical results illustrating this phenomenon.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain (0.04)
- Telecommunications > Networks (0.40)
- Information Technology > Networks (0.40)
Feature selection with gradient descent on two-layer networks in low-rotation regimes
This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analytic technique. The first regime is near initialization, specifically until the weights have moved by $\mathcal{O}(\sqrt m)$, where $m$ denotes the network width, which is in sharp contrast to the $\mathcal{O}(1)$ weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains at least the NTK margin itself, which suffices to establish escape from bad KKT points of the margin objective, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in extremely-well-separated groups, and the sample complexity scales with the number of groups; here the contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and can not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast to prior work, which required infinite width and a tricky dual convergence assumption. As purely technical contributions, this work develops a variety of potential functions and other tools which will hopefully aid future work.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois (0.04)